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1.  Summary 


The  primary  goal  of  this  research  was  the  development  of  an  enabling  technology  for 
DNA  computing.  It  is  focused  on  the  construction  of  a  biomolecular  architecture 
designed  to  employ  new  algorithmic  paradigms  based  on  the  massively  parallel 
computational  power  of  DNA  hybridization.  The  intent  is  to  develop  a  computing  basis  to 
eventually  overcome  the  exponential  time  complexity  of  many  discrete  math  problems  so 
that  they  can  be  solved  in  linear  real  time.  Many  of  these  computationally  hard  (NP) 
problems  are  critical  to  logistics,  scheduling  and  security.  In  this  way,  this  research 
address  computational,  national  security  and  knowledge  acquisition  challenges  of  the  Air 
Force.  In  this  report  we: 

1.  Give  new  metrics  for  cooperative  DNA  code  design  that  capture  key  aspects  of  the 
nearest  neighbor  thermodynamic  model  for  hybridized  DNA  duplexes. 

2.  Show  how  DNA  computing  can  be  applied  to  the  identification  of  independent  sets  in 
a  graph. 

3.  Show  how  our  software  uses  our  new  metrics  to  construct  a  biomolecular  computing 
architecture  to  address  the  identification  of  independent  sets  in  a  graph. 

2.  Introduction 

DNA  codewords  are  structural  and  information  building  blocks  in  biomolecular 
computing  and  other  biotechnical  applications  that  employ  DNA  hybridization  assays. 
Thermodynamic  distance  functions  are  important  components  in  the  construction  of  DNA 
codes.  We  introduce  new  metrics  for  DNA  code  design  that  capture  key  aspects  of  the 
nearest  neighbor  thermodynamic  model  for  hybridized  DNA  duplexes.  One  version  of 
our  metric  gives  the  maximum  number  of  stacked  pairs  of  hydrogen  bonded  nucleotide 
base  pairs  that  can  be  present  in  any  secondary  structure  in  a  hybridized  DNA  duplex 
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without  pseudoknots.  We  introduce  the  concept  of  (t-gap)  block  isomorphic 
subsequences  to  describe  new  string  metrics  that  are  similar  to  the  weighted  Levenshtein 
insertion-deletion  metric.  We  show  how  our  new  distances  can  be  calculated  by  a 
generalization  of  the  folklore  longest  common  subsequence  dynamic  programming 
algorithm.  We  give  a  Varshamov-Gilbert  like  lower  bound  on  the  size  of  some  of  codes 
using  our  distance  functions  as  constraints.  We  also  discuss  software  implementation  of 
our  DNA  code  design  methods. 

In  this  paper,  all  variables  are  nonnegative  integers  unless  otherwise  stated.  M 
denotes  the  set  {0,...,«-l}  and  00  denotes  the  sequence  1,2 ,...,«  .  Given  two 
sequences  a  and  P  ,  we  write  a  <  P  if  and  only  if  a  is  a  subsequence  of  P  .  The 
length  of  sequence  a  is  denoted  by  \a  .  We  call  «  -<  (n)  a  string  if  and  only  if  it  is  a 
subsequence  of  consecutive  integers,  e.g.,  a  =  i,i+  1 , . . . ,  i  +  k  .  For  a  <  b  ,  we  use  the 
notation  W,b\  for  the  string  of  integers  between  and  including  a  and  b.  If  a  =  b, 
we  sometimes  write  [«]  for  [a,b].  When  we  write 

a  =  [«i,  b\\,  [a2,b{],. bj\, . .  .[ak,bk\  where  <  bt  <  ai+ 1,  we  mean 

o  =  a\,a\  +  \,...,b\,a2,a2  +  l,...h2,...,a,-,a,-  +  .  .bh . .  .ak,ak  +  \,...bk  .  For 
a  <  (n),  t  <  (m)  with  |c|  <  |r|  5  wc  write  f  '■  o  ^  x  to  indicate  an  increasing 
function  f {i  i  ^  o}  ^  {i  :  i  e  x}  _  Given  cr  =  i\,h,---Jk  and  f :  o  ->  x  ?  we 
define  fl® )  =  fU  i  ),/(0),  •  •  •  ,/(4)  .  If  |c|  =  |r|  ,  then  f  o  ^  x  is  unique.  We  let 
[4]"  denote  the  set  of  sequences  of  length  n  with  entries  in  [4]  .We  identify  {0, 1,2,3} 
with  the  symbols  for  DNA  bases  {A,C,G,T}  .  Thus  sequences  in  [4]"  are  all  DNA 
sequences  of  length  n.  For  x  =  x\,...,x„  with  x  e  [4]"  and  cr  =  i\,i2,---Jk  where 
a  ■<  (n)  ,  we  let  xa  <  x  be  the  subsequence  xt l,xi2...,xik  .  Given  a  non-negative 
real-valued  function,  G  ,  on  [q]  ,  we  define  HX<J  II  n  =  2  G(x,-). 

ieo 
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3.  Methods,  Assumption,  Procedures 


3.1  Block  Isomorphic  Subsequences 

Definition  1.  For  o  <  (n),  a  substring  P<o  is  called  a  block  of  o  if  P  is  not 

subsequence  of  any  substring  a  of  o  with  P  ±  a.  A  subsequence  xp<x, 7  is  called 
a  block  of  I®  if  P  is  a  block  of  o.  For  o  <  00  ,  let  (Pi(<J ))  be  the  sequence  of 

blocks  of  o  ,  where  each  element  of  Piip)  is  less  than  every  element  of  Pi+ 1  00 .  Let 
b  jo)  an(L  Bfa)  be  the  left  and  right  endpoints  of  Pfa)-  When  the  context  is  clear, 
we  just  write  bt  ,  B,  and  Pi  respectively.  When  we  write  o  =  ([b,,B,])  =  (pi),  we 
say  that  it  is  the  block  representation  of  o.  When  we  write  xa  =  x(p,)  we  say  that  it  is 
the  block  representation  of  xa- 

Block  representations  are  unique.  Let  x  e  [4] 14  and  let  o  <  (14)  .  Suppose 
x  =  2,0, 1,2, 2, 3, 0,3, 2, 0,0, 1,3,2  and  <7  =  2,3,4,7,9,10,13,14  .  Then 

xa  =  0, 1,2, 0,2, 0,3, 2  and  the  block  representations  are:  cr  =  [2,4],  [7,7],  [9,10], 

[13,14]  and  xa  ~  x[2, 4], [7, 7], [9, 10], [13, 14]  . 

Definition  2.  For  2  <  t  <  n  -  1  ,  we  define 

Gt(n)  =  {o  <  («)  :  bM(o)  -  Bfo)  >  t}. 

We  call  Gt(n )  the  set  of  t-gap  sequences  of  (n). 

Note  that  <7  -<  00  =>  cr  <=  GO  00  and  2  <  t\  <  t2  <  n  -  1  => 

Gt2(n)  c  Gtfn). 
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Definition  3.  Let  cr  <  ( n ),  ?  <  (m)  with  I0'!  =  M  .  Let  f <J  ->  t  be  unique.  We 

say  that  cr  and  t  are  block  isomorphic  and  write  cr  =  r  if:  a  <  cr  is  a  string 
<=>  Aa)  <  T  is  a  string.  For  x  G  [c/]"  ,  y  e  [hY"  we  say  that  xa  and  Vr  are  block 
isomorphic,  denoted  by  xa=yT>ifxa=yT  and  cr  =  t  . 


Proposition  1.  Suppose  cr  =  t  and  f  o  -*■  t  is  unique.  Then  P  <  o  is  a  block  in 
o  if  and  only  if  f(P)  is  a  block  in  t  . 


Definition  4.  Let  2  <  t  <  min (n,m)  -  1  and  suppose  g  <  ( n )5  ?  <  im)  with 

a  =  x 

Fl  =  lTl  .  We  say  that  o  and  t  are  t-gap  block  isomorphic  and  write  t  if  and 
only  if  cr  G  Gt(n),  r  e  Gt(m)  and  o  =  x.  For  x  e  [q]n  ,  y  e  [q]m  , we  say  that 


Xa  =  y  t, 

x<j  and  Vr  are  t-gap  block  isomorphic,  denoted  by  t  if  and  only  if  xa  -  yT 


o  =  x 


and  t  .  We  say  that  x  and  V  have  a  common  t-gap  block  isomorphic  subsequence 

.  ,  .  ,  x0  =  yz 

if  and  only  if  there  are  cr  -<  W,  r  -<  (in)  with  t  .  Note  that  cr  =  z  o  o 

=  T  =  T  =>  (7  =  T 

2  and  cr  t2  n  when  t\  <  ti.  So,  we  just  write  cr  =  x  and  xa  =  Vr 

=  T  Xa  =  VT 

to  denote  o  2  and  2  respectively  ■ 


Example  1.  Let  x,y,z,w  e  [4] 13  and  cr ,  -<  (13)  withx=l,  1,  2,  0,  2,  3,  3,  0,  1,  1,  2,  0, 
1;  y=2,  0,  2,  3,  3,  0,  1,  1,  1,  1,  2,  0,  3;  z=3,  2,  0,  2,  1,  1,  1,  1,  1,  0,  3,  2,  0;  w=l,  1,  1,  2,  0, 

3,  2,  0,  2,  0,  3,3,3  .  Let  0 ,  =  [3,5],  [8,8],  [11,12]  ;  cr2  =  [1,3],  [6,6],  [11,12]  ; 

cr  3  =  [2,4],  [10,10],  [12,13]  ;  cr4  =  [4,5],  [7,10]  .  Then 

Xoi  =  V(72  =  Z<T3  =  W<J4  =  2, 0,2, 0,2,0.  Since  cr  1  =  cr2  =a3^a4,  we  have  that 

x o !  =  y<j2  =  z<j3  k  Wa 4.  Since  <r  1 , cr 2  G  G^(n)  and  cr 3  &  G^in),  we  have  that 
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*ai  =  y°2  £  ZCT3. 

3  3 

3.2  Block  Insertion-Deletion  Codes 

Definition  5.  For  x,y  e  [9]"  ,  we  define:  Pn,q(x,y )  =  max{||z||Q  :  z  -<  x  ancq  z  <  y } 
and  Lci,q{x,y)  =  min(||x  ||  a,  ||.y||  D)  -  pn(x,y).  We  say  that  Pn(x,y )  is  the  maximum 
weight  of  a  common  subsequence  to  x  and  y  and  Ln(x,y )  is  called  the  weighted 
Levenshtein  insertion-deletion  distance.  Ln(x,y )  is  a  metric.  When  llxo-  II  n  =  \x°  l»  we 
write  L(x,y )  for  Ln(x,y)  . 

Definition  6.  For  2  <  t  <  n  -  1  and  x,y  e  [<?]''  .  ife  define: 

fn,q(x,y)  =  max{||xCT  ||  n  :  =  yT}. 

t 

s  min(||x||n,  \\y\\n)  -  (f)fq(x,y). 

When  the  context  for  q  is  clear,  we  simply  write  and  $  n(x,y)-  We  say  that 

fcfx,}')  is  the  weight  of  the  longest  common  t-gap  block  subsequence  of  x  and  V  . 
When  IIxt  II  q  =  \x°  l>  we  write  and  ^‘(x^)  for  and  ^nC^.v) 

respectively.  For  t  =  1,  we  define  (x,y)  =  kn(x,y)  an(j  (j> 1  (x,y)  =  L(x,y). 

Proposition  2.  ^n(x,y)  is  a  metric  on  [qY  . 

Definition  7.  A  t-gap  block  insertion-deletion  q-ary  code  of  weighted  kl  distance  d  is  a 
subset  C,  of  [qY »  such  that:  x  ±  y  e  C  =>  >  d  . 

Theorem  1.  Let  \\x^  II  o_  =  \x°  I-  In  [qY  ,  there  is  a  2-gap  block  insertion-deletion  C 
of  d  =  n  -  k  with 
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ici>?‘(eO:;)(t)2) 


4.  Results,  Discussion 

4.1  Computing  ^n(,x,y) 

For  1  <  m  <  n  ,  consider  the  string  \m  +  h « ]  .  For  x,y  e  [q]n  ,  let  suf(x,y)  be 
the  length  of  the  longest  common  suffix  between  x  and  y  .  Then  suf(x,y )  =  0  if 
x„  *  yn  and  suf{x,y)  =  n-m  if  X[m+i,«]  =  y[m+i,n]  and  xm  ±  ym  .  For  x  e  [q]n  , 
we  have  that  x  [i ,/]  is  the  first  i  entries  of  x  and  x[i ,»]  =  x- 

Proposition  3.  Let  1  <  t  <  n  -  1.  For  x,y  e  [<y] "  and  1  <  Uj  -  n,  define 

Mqjj  =  $ ci(x[\j],y[\j]).  (o{r)  =  ||x[„_,.+i,n]  II  q  and  suf(x,y)  =  k  .  Define 

D'njj  =  max{co  (r)  +  :  1  <  r  <  k}}  tf  k  >  \  and  D'njj  s  0  if  k  =  0. 

Then 

Mn,ij  =  <l>n(x[D],y[ij])  =  max{M‘ni_ ,  j,  , ,  D‘nij} . 

When  either  i  or  /  is  less  than  or  equal  to  t,  the  initial  conditions  needed  for  the 
computation  of  0nC x,y)  are  0n(x[u],.V[i,/])  =  ll-j:[i,j]  II  n  if  and  only  if  x  [i ,«]  is  a 
substring  of  f[W|  . 

Example  2.  Zet  x  =  0, 1,2,3, 1,3,0, 1  and  y=  3, 0, 1, 3,2, 0, 3, 1  .  Figure  1  and  Figure 
2  are  M2  and  M2  where  II  x  „  ||  =  |<r|.  Figure  3  and  Figure  4  are  Mq  and  Mq 

llx*  II  n  =  +  !)• 

where  iea  Below  each  figure,  we  give  an  example  of  t-gap 

isomorphic  subsequences  with  ||xa  ||  =  lljr  II  =  <?f(x,.y)  ((p'(,x,y).) 
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0  1111111 

0  1  2  2  2  2  2  2 

0  1  2  2  2  2  2  2 

1  1  2  2  2  2  3  3 

1  1  2  2  2  2  3  4 

1  1  2  2  2  2  3  4 

1  2  2  2  2  3  3  4 

1  2  3  3  3  3  3  4 

Fig. 1:  *  [1,2], [4 ,5]  =  J[2,3],[7,8] 
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1  X  [4], [6], [8] 

= 

y  [n,[4],[7] 
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1  2  3  3  3  3  3  3 

Fig.  2:  X  [1,2], [8,8]  =  T [2,3], [8, 8] 
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xm,[4,5] 
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J[2],[7,8] 
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4.2  Sequences  of  t-Strings 

In  this  section,  we  apply  the  results  of  previous  sections  to  sequences  of  strings  of 

length  t  (with  particular  attention  to  t  =  2)  that  naturally  arise  from  x  e  [q "  ] .  The 
goal  is  to  then  apply  these  results  to  the  modeling  of  DNA  hybridization  distances. 
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Definition  8.  For  o,z  <  ( n )  and  1  <  t  <  n  -  1,  /et  o'  <  o  be  defined  by: 
i  e  o'  o  \i,i  +  t  -  l]  <  o  '  /f  \o\  =  \z\  t  then  for  the  unique  f  ■  o  ^  r  r  we  define  o[ 

as:  i  £  o'T  o  [i,  i  +  t  -  1]  <  a  and  [/T0,/(0  +  t-  1]  <  t.  We  define  <  x  by 

To 

Example  3.  Let  o ,  x  -<  (16)  be  given  in  their  block  representations  with  o  =  [1,4]  , 
[7,10]  (  [12,15]  and  T  =  [2,8]  )  [12,16].  77zen  o2  ,  r2  ,  a2  and  T2a  are: 
o2  =  [1,3]  ,  [7,9]  ,  [12,14];  r2  =  [2, 7],  [12, 15];  a?  =  [1, 3],  [7, 8],  [12, 14]; 

r2  =  [2, 4], [6, 7], [13, 15]  . 

Then  cr3  ,  r3  ,  cr3  and  tl  are:  cr3  =  [1,2],  [7, 8],  [12, 13];  r3  =  [2,6]  j  [12,14]; 

=  [1,2],  [7, 7],  [12, 13];  =  [2, 3],  [6, 6],  [13, 14]  .  A  careful  inspection  of  o2  , 

Ta  and  of  Ta  demonstrates  the  general  result  that  1  ~  °  . 

For  x,y  e  [q\n  with  xa  =  yT  ,  we  have  that  i  e  if  and  only  if  i  is  the  first 

index  in  a  common  t-string  ,  x[i,i+t- 1]  =  F [/(<),/(' )+'-'  ]’  of  the  common  subsequence 

x  a  =  y  r  of  x  and  V.  Thus  \o[\  is  the  number  of  common  t-strings  that  occur  in  the 

common  subsequence  xa=yT  0f  x  and  y  .  In  particular,  l°n  is  the  number  of 

common  2-strings  that  occur  in  the  common  subsequence  xn  =  Vi  of  x  and  V.  This 
measure  is  of  interest  to  us  because  when  two  DNA  strands  have  a  secondary  structure  in 
a  duplex,  the  thermodynamic  weight  (e.g.,  free  energy)  of  nearest  neighbor  stacked  pairs 
of  that  secondary  structure  is  a  measure  {not  the  measure)  of  the  thermodynamic  stability 
of  the  duplex  with  the  given  secondary  structure.  Since  every  secondary  structure  in  the 

DNA  duplex  x  :  y  between  x  and  the  complement,  T ,  of  }’  corresponds  to  a 

common  subsequence,  x„  =  yT,  between  x  and  V,  we  have  that  l°T  I  gives  us  the 

number  of  nearest  neighbor  stacked  pairs  in  the  x  :  y  duplex  with  the  secondary 
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structure  associated  with  xa  -  yT  .  In  general,  \g’t\  gives  us  the  number  of  common  t- 

stems  in  the  x  :  J  duplex  with  the  secondary  structure  associated  with  xa  =  Jr .  We 
now  show  how  we  compute  the  “total  weight”  of  the  common  2-strings  that  occur  in  the 

common  subsequence  xa  =  yz  of  x  and  y  . 

Definition  9.  Suppose  2  <  t  <  n  -  1.  Given  a  string  [a,b]  <  in)  and  x  e  [q]"  ,  let 
dqix[a,b ])  be  the  unique  number  in  W ]  whose  q-ary  decimal  representation  is 
%  a%  a+ a+2  •  •  •  %  b  •  For  x  e  [ q ]"  )  x(<)  e  [^']"  '  be  defined  as 

x{,)  =  (dq(x[i,i+t- i]))'L7+1-  For  example,  if  x  =  2,  3,  3,  0,  3,  0,  2,  2,  1,  1,  2,  0,  2  e  [4] 13 
,  then  *(2)  =  11,  15,  12,  3,  12,  2,  10,  9,  5,  6,  8,  2  e  [16] 12  and  x(3)  =  47,  60,  51,  12, 
50,  10,  41,  37,  22,  24,  34  e  [64]11. 

Definition  10.  Suppose  2  <  t  <  n  -  1  .  Let  91  be  a  weight  function  on  W\  Then 

for  x,y  e  [q]n ,  we  define:  Wnix,y)  =  max{||xa,  ||  n  •  X(T  =  Jr}  and 

=  ™nCllx^  II  o’  Hj ^  ^  II  n)  ~  Wnixiy)-  if  H-W  II  q  =  |c|,  then  we  write 
Yfix,y)  and  h"(x,y)  for  y/n (x,y)  and  4Jo(x,v)  respectively. 

Proposition  4.  Suppose  2  <  t<  n  -  \  .Let  91  be  a  weight  function  on  \cl']-  Then 

for  x,y  e  [q]n,  we  have: 

ra(x,y)  =  fn,M‘\ydl) 

Definition  11.  A  q-ary  code  of  91  weighted  t-stem  distance  d  is  a  subset  C  of  [q]n 
such  that:  x  ±  y  e  C  =>  A"n(x,y)  >  d.  If  t  =  2,  we  call  such  a  C  a  q-ary  code  of 
91  weighted  stacked  pair  distance  d.  Thus  if  C  is  a  91  weighted  code  of  t-stem 
distance  d,  then:  x  I  y  e  C  =>  ||x  ||  n  -  d  and 
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Il>,(  II  q  (l)ci,ql(x<'  ^y1' -  d.  ij-  ||Xfj  ||  n  -  |x,r  |,  we  simply  call  such  a  C  a  t- stem 
code  of  distance  d  ■ 


From  a  DNA  duplex  point  of  view,  with  Cl  =  F  being  the  thermodynamic 

II  (2)  II 

weight  of  the  (virtual)  stacked  pairs  of  nucleotides  (see  Table  1,)  H  o  is  the 

absolute  value  of  nearest  neighbor  free  energy  of  the  duplex  x  :  Y  .  Thus,  if  C  is  a 
(A,C,G,T)  quaternary  code  of  F  weighted  stacked  pair  distance  d,  then  xdyeC 

implies  that  the  thermodynamic  stability  of  each  of  the  duplexes  x  :  Y  and  V  :  >’  is  at 
least  “d  greater  than”  the  thermodynamic  stability  of  the  duplex  x  :  y .  Moreover,  if  C 
is  a  (A,C,G,T)  quaternary  t-stem  code  of  distance  d,  then  x  ±  y  e  C  implies  x  :  Y  and 
y  ■  y  each  have  at  least  “d  more”  common  t-stems  than  are  in  any  secondary  structure 
for  duplex  x  :  y .  The  main  point  of  application  is  that  V'fCg  y)  is  a  measure  of  the 
nearest  neighbor  stability  of  the  DNA  duplex  x  :  y  and  ¥ '  (x,y)  is  the  maximum 
number  of  t-stems  that  can  form  in  any  secondary  structure  of  x  :  y.  For  a  2-stem  code 
C  of  distance  d  =  (n  -  1)  -  k  ?  we  have  for  x  h  y  e  C,  that  the  maximum  number  of 

4 — 

stacked  pairs  in  a  secondary  structure  of  the  duplex  x  :  y  is  at  most  k  while  the 
number  of  stacked  pairs  in  each  of  the  x  :  Y  and  >’  '■  y  duplex  is  n  -  1  . 


Theorem  2.  In  [f\n ,  there  is  a  2-stem  code,  C,  of  distance  d  -  (n  -  1)  -  k  with: 


|C|>^fi:ryQ:11)(7): 
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5.  Conclusions 


5.1  Applications  to  DNA  Hybridization  Distance  Modeling 

Single  strands  of  DNA  are,  abstractly,  ( A,C,G,T )  -quaternary  sequences,  with 
the  four  letters  denoting  the  respective  nucleic  acids.  DNA  sequences  are  oriented;  e.g., 

5'AACG3'  is  distinct  from  5'GCAA3'  ,  but  it  is  identical  to  3'GCAA5'  .  The 

orientation  of  a  DNA  strand  is  usually  indicated  by  the  5'  ->  3 ',3'  ->  5'  notation  that 
reflects  the  asymmetric  covalent  linking  between  consecutive  bases  in  the  DNA  strand 
backbone.  In  this  paper,  when  we  write  DNA  molecules  without  indicating  the  direction, 

it  is  assumed  that  the  direction  is  5'  ->  3'  .  Furthennore,  DNA  is  naturally  double 
stranded.  That  is,  each  sequence  normally  occurs  with  its  reverse  complement,  with 
reversal  denoting  that  two  strands  are  oppositely  directed,  and  with  complementarity 
denoting  that  the  allowed  pairings  of  letters,  opposing  one  another  on  the  two  strands,  are 

{ A,T }  or  {G,C}  — the  canonical  Watson-Crick  base  pairings.  Therefore,  to  obtain  the 
reverse  complement  of  a  strand  of  DNA,  first  reverse  the  order  of  the  letters  and  then 
substitute  each  letter  with  its  complement.  For  example,  the  reverse  complement  of 

AACGTG  is  CACGTT  .If  x  =  AACGTG  ,  then  we  let  Y  denote  it  reverse 

complement  CACGTT  .  We  let  denote  x  listed  in  reverse  3'  ->  5'  order.  As  DNA 

sequences,  x  and  x  are  identical,  i.e.,  x  =  5  x\,...,xn3  =  *x  =  3  x x  i  5  ’ 

A  Waston-Crick  (WCJ  duplex  results  from  joining  reverse  complement  sequences 
in  opposite  orientations,  e.g., 

«_  5'  AACGTG3' 

x  :  x  =  3' TTGCACS'  ■ 

Whenever  two,  not  necessarily  complementary,  oppositely  directed  DNA  strands 
“mirror”  one  another  sufficiently,  they  are  capable  of  coalescing  into  a  DNA  duplex.  The 
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process  of  forming  DNA  duplexes  from  single  strands  is  referred  to  as  DNA 
hybridization.  The  greatest  energy  of  duplex  fonnation  is  obtained  when  the  two 
sequences  are  reverse  complements  of  one  another  and  the  DNA  duplex  formed  is  a  WC 
duplex.  However,  there  are  many  instances  when  the  fonnation  non-WC  duplexes  are 
energetically  favorable.  In  this  paper,  a  non-WC  duplex  is  refened  to  as  a  cross- 
hybridized  (CH)  duplex. 

An  n  -  DNA  code  is  a  collection  of  single  stranded  DNA  sequences  of  length  n .  In 
DNA  hybridization  assays,  the  general  rule  is  that  formation  of  WC  duplexes  is  good, 
while  the  formation  of  CH  duplexes  is  bad.  A  primary  goal  of  DNA  code  design  is  be 
assured  that  a  fixed  temperature  can  be  found  that  is  well  above  the  melting  point  of  all 
CH  and  well  below  the  melting  point  of  all  WC  duplexes  that  can  form  from  strands  in 
the  code.  (It  is  also  desirable  for  all  WC  duplexes  to  have  melting  points  in  a  narrow 
range.)  Thus  the  fonnation  of  any  WC  duplex  must  be  significantly  more  energetically 
favorable  then  all  possible  CH  duplexes.  A  DNA  code  with  this  property  is  said  to  have 
high  binding  specificity.  High  binding  specificity  is  akin  to  a  high  signal-to-noise  ratio. 

A  natural  simplification  for  formulating  binding  specificity  is  to  base  it  upon  the 
maximum  number  of  WC  (inter-strand,  non-covalent  hydrogen)  base  pair  bonds  between 
complementary  letter  pairs  which  may  be  formed  between  two  oppositely  directed 

strands.  Then  for  x,y  e  C  ?  an  upper  bound  on  this  maximum  number  of  base  pair  bonds 

that  can  fonn  in  the  x  ■  V  duplex  is  the  maximum  length  of  a  common  subsequence  to 

x  and  y  .  In  short,  two  single  stranded  DNA  sequences  x  and  V  of  length  n  can 

form  d  base  pairs  bonds  in  a  duplex  only  if  0 1  (x,  y )  <  n  -  d  .  This  doesn’t  mean  that 

x  and  y  will  form  d  base  pair  bonds  in  a  hybridization  assay,  it  just  says  they  could 
never  form  more  than  d  base  pair  bonds. 

If  the  binding  specificity  were  solely  dependent  on  the  number  of  base  pair  bonds, 
then  n-DNA  codes  constructed  by  using  ® 1  (x,y)  as  the  distance  function  could  be  used 
in  hybridization  assays  with  assured  high  binding  specificity.  This  is  because  if  n  -  d  is 
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large  enough,  then  one  could  find  a  temperature  that  exceeds  the  d  base  pair  bonding 
threshold  of  all  x  :  y  CH  duplexes,  but  is  below  the  melting  point  of  each  x  :  Y  WC 
duplex  in  which  n  base  pair  bonds  form. 

However,  while  the  melting  point  of  DNA  duplexes  depends,  in  part,  on  the 
number  of  base  pair  bonds,  the  state  of  the  art  model  of  DNA  duplex  thermodynamics  is 
the  Nearest  Neighbor  Model  (NN).  In  the  NN  model,  thennodynamic  (e.g.,  free  energy) 
values  are  assigned  to  loops  rather  than  base  pairs.  We  now  briefly  discuss  some  key 
aspects  of  the  NN  model. 

Consider  two  oppositely  directed  DNA  strands  x  =  5  xi,x2,...,x,,...,x„3'  and 

y  =  3  y  i  ,>’2 , . . .  . . .  ,yn  5  where  >7/  denotes  the  complement  to  base  >7/  .  A 

secondary  structure  of  the  DNA  duplex  x  :  y  is  a  sequence  of  pairs  of  complementary 

bases  ((WAy,))  where  (x,J  and  (fyv)  are  subsequences  of  x  and  Y  respectively. 

Clearly  the  duplex  x  :  y  can  have  many  secondary  structures.  An  important  issue  is  to 
understand  which  secondary  structure  in  the  most  energetically  favorable. 

The  collection  of  complementary  pairs  in  a  given  secondary  structure  of  a  duplex 
partitions  the  duplex  in  to  pairs  of  substrings  (or  subduplexes) 

that  have  the  (x  u )  >  (f/V )  and  (x  Ni )  >  (fyV+i )  as  endpoints.  For  example,  in  the  x  ■  y 
duplex  presented  as: 

5'xi,x2,...,x/13'  *5 'xir,...,xir+13' 

3'v7,_vY,...,y-5'  *3' yj-r,...,y]^5'  *3 ' yj;, . . .  ,y^S' 

each  pair 

5  X  ir, . . .  ,X  ir+1 3 

3  'yTr,---,yJY^' 

of  substrings  (separated  by  *)  is  an  elementary  substructure  called  a  loop  of  the  given 
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secondary  structure  ((xir>yjr))  of  the  given  duplex  x  :y  .  If  each  of  the  strings  in  a 
loop  are  of  length  2,  e.g., 

5  Xjr,Xjr+\  3 
3'v^,.tvT5'  ? 

then  that  loop  is  called  a  stacked  pair. 

Example  4.  We  use  mix  of  lower  case  and  upper  case  letters  to  help  identify  the 
secondary  structure.  Consider  the  duplex 

5 '  ggCaTaTcatACth ' 

3 '  TccAAttGgtaGa5 ' 

where  the  secondary  structure  is  igw),  ( g,c ),  ( a,t ),  ( a,t ),  (c,g),  ( a,t ),  (t,a),  ( t,a )  .  Loops 

are  and  are  listed  left  to  right: 

g  *  gg  *  gGc  *  aTa  *  aTc  *  ca  *  at  *  tACt 

Tc  *  cc  *  cAAt  *  tt  *  tGg  *  gt  *  ta  *  aGa 

Z7  *Z7  *Z7  *77  *Z7  *Z7  *Z7  *Z7 

—  tL  i  +  *  -^3  *-^4  *  eS  5  *  ^  6  *  *  -^8  . 

8 

E 

The  free  energy,  AG,  o/  the  duplex  predicted  by  the  NN  model  is  approximately  i=\ 
lsG,  where  A  G,  is  the  free  energy  assigned  to  loop  Ej  .  However,  in  many  cases,  the 
most  stabilizing  features  of  the  structure  come  from  the  stacked  pairs  i.e.,  E  2  ,  E$  ,  and 

E  7,  and  the  free  energies  of  stacked  pairs  are  the  most  accurately  measured.  See  []. 
The  free  energies  for  most  non-stacked  loops  are  approximated  from  the  free  energy  for 
stacked  pairs  with  the  same  terminal  pairs.  For  example,  the  free  energy  of 
j,  _  5'gCa3' 

3  if 

3  cAAt  5  would  be  approximated  by  adding  a  “ penalty ”  to  the  free  energy  for 

5'ga3' 

the  measured  free  energy  for  the  stacked  pair  3  ct5  (that  does  not  appear  in  the 
above  secondary  structure.)  In  most  cases,  the  penalty  takes  on  a  positive  value  while  all 
of  the  free  energies  for  stacked  pairs  are  negative.  It  is  therefore  reasonable  to  assume 
that  if  one  only  considers  the  free  energies  for  the  stacked  pairs,  then  their  sum  would  be 


14 


a  lower  bound  for  the  NN  free  energy  for  the  given  duplex  with  the  given  secondary 
structure. 

Consider  two  identically  directed  DNA  strands  x  =  5  x  i  ,X2 ,  ■  ■  ■  ,xu . . .  ,xn 3 '  and 

y  -  v  yi,y2,...  ,Vj,  ...,yni.  For  computational  purposes,  we  define  the  idea  of  a  virtual 
secondary  structure  between  these  two  identically  directed  strands  even  thought  no  such 
structure  would  naturally  form.  A  virtual  secondary  structure  of  the  virtual  DNA  duplex 

x  :  y  is  a  sequence  of  pairs  of  identical  bases  (Oa^F/v))  where  Ga)  and  C Vjr)  are 

subsequences  of  x  and  V  respectively.  In  other  words,  a  virtual  secondary  structure  of 

the  virtual  duplex  x  :  y  is  a  common  subsequence  xa  =  yT  0f  x  and  V  .  Then  the 

virtual  duplex  x  :  y  has  the  virtual  secondary  structure  (Ga,}7/,.))  if  and  only  if  the 

i -  f 

actual  duplex  x:y  (where  x  =  5  x  j  ,x2,  ■  ■  ■  ,x,-, . . .  ,x„3'  and 

y  =  3  >!  i  ,>’2, . . . . . .  ,>’«5  )  has  the  actual  secondary  structure  of  pairs  of 

complementary  bases  (0 ^cNj,))  where  (x /, )  and  Of.)  are  subsequences  of  x  and 

5  x  ir,x  jr+ 1 3 

y  respectively.  A  stacked  pair,  3  T/vAyV+i  5  exists  in  the  actual  secondary  structure 

I  i 

5  Xir,Xir+\3 

(Oa^-T/v))  if  and  only  if  the  corresponding  virtual  stacked  pair,  5  yJr,yjr+\ 3  exists 
in  the  virtual  secondary  structure  of  the  virtual  duplex  x  :  y.  Thus,  there  exists  a  virtual 
stacked  pair  in  a  virtual  secondary  structure  xa  =  yT  if  and  only  if 
(x  / ,  x  j+ 1 )  =  iyj(i),yf(  0+i )  is  a  common  2-string  of  the  common  subsequence  xa  —  y% 
where  f  '■  o  ^  r  is  unique. 

Identifying  virtual  stacked  pairs  with  their  natural  representation,  the  virtual  free 
energy  (O  values  can  be  associated  to  the  negative  of  their  corresponding  values 
(AG)  for  actual  stack  pairs.  The  actual  values  are  given  in  KCAL/mole  measured  at  37C 
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and  with  specified  ionic  concentrations.  Table  1  gives  the  values  with  their  corresponding 
virtual  stacked  pairs.  Since  the  virtual  stacked  pair  is  a  pair  of  identical  2-strings 

(xi,xi+ 1)  =  (yy(0’Ty(0+i)  ,  we  can  represent  this  virtual  stacked  pair  by  (x<Ai+i )  and 
denote  its  virtual  free  energy  by  F(xj,xi+  i).The  i},j)th  entry  of  Figure  5  is  the  value 
of  F(i,j)  ,  e.g.,  F(C,T)  =  1.28  .(So  F(C,T)  denotes  the  free  energy  associated  with 
5'CT3' 

the  3' GA51  naturally  occurring  stacked  pair.  We  take  F  as  our  weight  function  on 

[42]  . 


F 

A 

C 

G 

T 

A 

1.00 

1.44 

1.28 

0.88 

C 

1.45 

1.84 

2.17 

1.28 

G 

1.30 

2.24 

1.84 

1.44 

T 

0.58 

1.30 

1.45 

1.00 

Figure  5.  Thermodynamic  weight  of  virtual  stacked  pairs. 


Let  x  :  y  be  an  actual  duplex  and  let  AG(x  :  y )  be  the  NN  computation  of  the 
free  energy  of  the  x  '■  V  duplex.  The  main  point  of  all  of  this  is  that  it  is  quite 

reasonable  to  assume  that  in  most  cases:  ~WF(x,y)  <  AG(x  :  y ).  From  a  DNA  duplex 
point  of  view,  with  F  being  the  thennodynamic  weight  of  the  virtual  stacked  pairs  of 

nucleotides,  we  have:  )  II  p  =  ~A(j(x  .  x )  .  Thus  if  C  is  a  F  weighted  stacked 

paired  ( A,C,G,T )  quaternary  code  of  distance  d  ,  then:  x  ±  y  e  C  =>  vF2(x?t)  -  d. 
This  implies  that  the  thermodynamic  stability,  -A  G{x  :  x)  and  -A  G(y  :  y)  ,  of  each 
(all)  of  the  WC  duplexes  x  :  Y  and  y  ■  Y  ,  respectively  would  each  be  at  least  “d 
greater  than”  the  thermodynamic  stability  ,  -A G(x  :  y),  of  the  (any)  x  :  y  CH  duplex 
where  x  =f=  y  e  C.  Thus,  n-DNA  codes  closed  under  complementation 

(x  e  C  o  Y  e  C)  constructed  by  using  as  the  distance  function  could  be 
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used  in  hybridization  assays  with  high  binding  specificity. 


Example  5.  Given  DNA  sequences  x  =  GTTATAGGCCGAG  and  y  —  CGTC 
GTGTATATT  of  length  13,  consider  the  virtual  secondary  structure  xo  =  Jr  with 
a  =  [fi6]  and  r  =  [5,6],  [8, 1 1],  We  have  that  cl  =  [1,1],  [3,5]  and 
to  =  [5,5],  [8, 10].  We  use  lower  case  letters  to  exhibit  the  common  subsequences  that 
represent  the  virtual  secondary  structures  represented  by  x<?  =  Jr  ; 

gttataGGCCGAG 

CGTCgtGtataTT 


Identify  0  =  A  t  l  =  C  ,  2  =  G  and  3  =  T  and  convert  the  DNA  sequences 

accordingly.  Then  x(2)  =  11, 15,12,3,12,2, 10,9,5,6,8,2  and 

J(2)  =  6,11,13,6,11,14,11,12,3,12,3,15  where  the  (bold  faced)  block  isomorphic 

x(2)  =  v(2)  ,  , 

subsequence,  o2  y  d  '  represents  the  four  virtual  stacked  pairs  gt,ta,at,ta  in  the 

displayed  virtual  secondary  structure  xo  =  gttata  =  yT.  Using  the  F  in  Figure  5,  we 

f  However,  V/f(x,j)  =  3.61.  This  is  because  the  virtual 

secondary  structure  xa  =  Vp  with  a  =  [1,2], [10, 1 1], [13, 13]  and  P  =  [2,5], [7,7] 
depicted  using  lower  case  letters  as: 


have  that. 


(2) 
X  2 

Or 


gtTATAGGCcgAg 
CgtcgTgTA  TA  TT 


(2)  _  i  ,  _  (2) 

has  aP  =  [1,1],  [10,10]  and  Pi  =  [2,2],  [4,4],  Then  X  a)>  ’  X  & 

represents  the  virtual  stacked  pairs  gt  and  eg  in  the  virtual  secondary  structure 

xa  =  gtegg  =  yp.  Finally,  we  have  that 


(2) 

XV 

«« 


3.61. 


Example  6.  Given  DNA  sequences  x  =  AATCCAACATTATTGC  and  J  - 
GTCACATCATCAAGCC  and  using  the  F  in  Figure  5,  we  have  lx(  _  18-39, 
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||j(  ^  II  p  20.7  ancj  y/p(x,y)  -  8.19.  Thus  -  10.20.  We  also  have  that 

x(2)  =  0,3,13,5,4,0,14,3,15,12,3,15,14,9  ; 
y(2)  =  11,13,4,1,4,3,13,4,3,  13,4,0,2,9,5. 


There  are  at  most  six  stacked  pairs  in  any  virtual  secondary  structure  between  x  and 
y,  i.e.,  [l/2(x,y)  =  (t>fxU),y<2>)  =  6.  A  virtual  secondary  structure  that  has  six 


stacked  pairs  is  x°  ~  •*[3,4],[7,io],[i2,i3],[i5,i6]  -  V[2,6],[8,9],[i4,i5]  -y  t-  These  six  stacked 
pairs  are  represented  by  the  common  block  isomorphic  subsequence 

X(2)  -  X(2)  ~  V(2)  V(2) 

Xo]  -  X[3, 3], [7, 9], [12,12], [15, 15]  =  /[2,2],[4,6],[8,8],tl4,14]  =  Y •  Jn  fas  case, 


vHx,y)  =  $Hx{2)y2)) 


X 


(2) 

[3, 3],  [7, 9],  [12, 12],  [15. 15] 


F 


8.19. 


We  also  have  that 


x(3)  =  2,9,36,20,16,1,4,18,10,39,33,10,42,45 
y(3)  =  57,35,17,4,18,9,35,18,9,  35,16,3,13,53. 

Since  V3(x,y)  =  f'  (x(3),y(3))  =  2,  we  have  that  most  number  of  3-stems  in  any 

secondary  virtual  secondary  structure  between  x  and  y  is  2.  Note  that 

x(3)g] 

3  Note  that  the  virtual  secondary  structure 

xo  =  X[3, 4], [7, io], [12, 13], [15, 16]  =  ^[2, 6], [8,9], [14, is]  =Vt  has  exactly  two  3-stems,  namely 
x  [7,9]  =  y[  4,6]  =  ACA  and  X  [8,10]  =  V[5,7]  =  CAT. 


5.2  Biomolecular  Computing  Architecture 

A  DNA  bit  string  of  length  N  is  a  DNA  molecule  (single  long  strand)  that  consists 
of  N  distinct  nonoverlapping  substrands.  Suppose  we  have  a  DNA  code  C  of  size  4N 
partitioned  into  coding  strands  (2N)  and  probe  stands  (2N).  For  example,  consider  the 
DNA  code  below.  It  has  twenty  codewords  strands  each  12  bases  long.  Ten  of  these  are 

labelled  T  ;  or  F  ;  and  ten  are  labelled  Probe(T  ,  )  or  Probe(F  ;  ).  Probe(X)  is  the  WC 


18 


complement  of  X.  This  allows  use  to  code  and  read  32  DNA  bit  strings.  The  DNA 
library  has  32  longer  strands  of  60  bases  of  the  fonn  X1X2X3X4X5  where  X  / 
=T  i  or  F  i  as  given  below.  See  Figure  6. 


DNA  Computing  Strand  Engineering 


Maximum  CH  free  energy  parameter:  5 

Nearest  neighbor  WC  free  energy  LOWER  BOUND  =10 

Nearest  neighbor  WC  free  energy  UPPER  BOUN  =13 


AAAAAAAAC  C=T1 

GGTTTTTTTT  =BEAD  PROBE  (Tl) 


TTTTTGGA  AA  =  BEAD  PROBE  (FI) 


TTTCTTAAC  C=T2 

GGTTAAGAA  A=  BEAD  PROBE  (T2) 


ACTAACAAA  A=F2 

TTTTGTTAGT=  BEAD  PROBE  (F2) 


CATAAAACA  C=T3 

GTGTTTTAT  G=  BEAD  PROBE  (T3) 


ATCTTTTCA  A=F3 

TTGAAAAGAT=  BEAD  PROBE  (F3) 


CAATCCATT  A=T4 

TAATGGATT  G=  BEAD  PROBE  (T4) 


CCTTCTAAA  T=F4 

ATTTAGAAG G=  BEAD  PROBE  (F4) 


ACTCCTAAT  A=T5 

TATTAGGAGT=  BEAD  PROBE  (T5) 


TCTCTCTACT,  Strand  C  10=FI 

AGTAGAGAG  A=  BEAD  PROBE  (F5) 


AAAAAA AACC 


AAAAAA AACC 
AAAAAA AACC 
AAA AAA AACC 
AAA AAA AACC 
AAA AAA AACC 
AAA AAA AACC 
AAAAAA AACC 


TTTCTTAAC  C-C  ATAAAACA  C-T4-T5 


: -TTTCTTAAC  C-C  ATAAAACA  C-T4-F5 
: -TTTCTTAAC  C-C  ATAAAACA  C-F4-T5 
: -TTTCTTAAC  C-C  AT  A  A  A  A  C  A  C-F4-F5 
•-TTTCTTAACC  -ATCTTTTCA A-T4-T5 
’--TTTCTTAACC  -ATCTTTTCA A-T4-F5 
: -TTTCTTAACC  -ATCTTTTCA A-F4-T5 
: -TTTCTTAACC  -ATCTTTTCA A-F4-F5 
AAAAAAAAC  C-A  CTAACAAA  A-C  ATAAAACA  C-T4-T5 
AAAAAAAAC  C-ACTAACAAAA-C  ATAAAACA  C-T4-F5 
AAAAAA  A  ACC-ACT  A  AC  A  A  A  A-C  AT  AAA  AC  A  C-F4-T5 
AAAAAAAAC  C-A  CTAACAAA  A-C  ATAAAACA  C-F4-F5 
AAAAAAAAC  C-A  CTAACAAAA  -ATCTTTTCA  A-T4-T5 


DNA  CODE 


28.  TTTCCAAAAA  -AC  T  A  AC  A  A  A  A-C  AT  A  A  A  A  C  AC-F4-F5 

29.  TTTCCAAAAA  -A CTAACAAAA  -ATCTTTTCA A-T4-T5 

30.  TTTCCAAAAA  -A CTAACAAAA  -ATCTTTTCA A-T4-F5 

31.  TTTCCAAAAA  -A CTAACAAAA  -ATCTTTTCA A-F4-T5 

32.  TTTCCAAAAA  -A CTAACAAAA  -ATCTTTTCA A-F4-F5 

DNA  LIBRARY= 

DNA  BITSTRINGS 


Figure  6.  DNA  Strand  Engineering 

As  indicated  above,  we  identify  DNA  bit  strings  and  binary  sequences.  For 
Ic[N]  and  (e, ),  el  a  binary  sequence,  let  K  be  the  following  a  subset  of  binary  N- 
sequences  defined  as  K  =  {(b;) :  bj  =  e;  for  some  i  el}.  K  is  the  set  of  all  binary 
sequences  that  satisfy  the  disjunctive  clause  K'  over  N  Boolean  terms,  each  of  which  is  a 
variable  xf(if  e;  =1)  or  its  negation  ~  x;(  if  e;  =0.)  The  main  "computing"  idea  is  an 
iteration  of  the  following:  Given  a  subset  T  of  DNA  bit  strings  and  a  set  K  defined  above, 
the  subset  T  n  K  can  be  extracted  from  the  set  T  by  hybridization. 

We  now  discuss  a  problem  that  is  of  particular  interest  to  us. 
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Problem  1.  Let  Ph  P2,  Pm  be  fixed  subsets  of  [N], 


a.  Find  all  S  c:  [N]  with  S  c£  P;  for  all  i  with  1  <  i  <  m . 

b.  Find  all  T  cz  [N]  with  P;  cp  T  for  all  i  with  1  <  i  <  m . 

Both  of  these  problems  are  related  and  are  simplified  forms  of  the  general  SAT 
problem.  They  can  be  solved  by  the  method  described  above.  (These  are  simplifications 
because  no  negations  appear  in  the  clauses.) 

There  is  one  important  difference.  In  the  SAT  problem,  only  one  solution  needs  to 
be  found.  Here  all  solutions  are  required. 

Let  (b;)  be  a  binary  n-sequence.  As  above,  let  K;  =  {(bj) :  bj  =  1  for  some  j  £  P;}. 

m 

Clearly  all  S  cp  P;  for  all  i  with  1  <  i  <  m  is  the  set  of  all  S  with  incidence  vector  in  IK 

i=l 

In  the  DNA  bit  string  representation,  K;  =  {(bj) :  bj  =  rj  for  some  j  p  P,} .  The  associated 
filter  Fj  consists  of  {lj  :j  £P;}.  If  a  set  S  of  DNA  bit  strings  of  length  N  is  passed 
through  Fi?  then  only  the  bit  strings  in  K;  remain  in  the  gel  of  F;.  Starting  with  all 
possible  DNA  bit  strings  and  iterating  the  filter  process  outlined  above  m  times,  we  arrive 
at  Fm.  Fm  contains  all  the  DNA  bit  string  representations  of  the  solutions  to  Problem  la. 
Problem  lb  can  transformed  into  Problem  la  because  P;  <p  T  if  and  only  if 
[N]-T<z[N]-Pi. 

The  most  straightforward  application  of  the  above  problem  is  in  the  identification 
of  independent  sets  in  a  graph.  If  one  takes  all  the  edges  of  a  simple  graph  G  as  the 
collection  {P;} ,  then  the  set  of  all  T  is  the  collection  of  independent  sets  in  G.  See 
Figures  7-10. 
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DNA  Computing  for  Independent  Sets 


Let  Qj,  Q2,...Qk  be  fixed  subsets  of  {l,2,...,n}. 


a. 

b. 


Find  all  subsets  S  c 
Find  all  subsets  T  c 


{l,2,...,n}  with  S  <2  Q;  for  i  with  1  <  i  <  k. 
{l,2,...,n}  with  Q;  T  for  i  with  1  <  i  <  k 


G  = 


G’= 


4 

5 


Let  {1,2},{1,4},{2,3},{2,5},{3,4},{4,5}=Q1v..,Q6  be  fixed  subsets  of  {1,2,. ..,5}. 

Finding  all  subsets  T  cz  {l,2,...,n}  with  Qj  c£  T  for  i  with  1  <  i  <  6,  is  finding  all  independent  sets 
in  G  or  all  cliques  in  the  complement  G'. 


Figure  7.  Independent  Sets  Problem 


The  filter  above  give  all  sets  of  vertices  that  do  not  contain  edge  {1,2}. 
If  this  process  is  iterated  (as  in  Figure  8),  all  independent  sets  (or 
cliques)  will  be  identified. 
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DNA  Library 


Figure  9.  Filters  for  Edges  of  Graph  in  Figure  7. 

Outflow  from  previous  (red  or  black,  but  not  both)  filter  is  passed  on  to 
the  next  filter  of  the  same  color.  The  final  outflow  is  the  set  of 
molecules  that  represent  independent  sets  (black)  or  cliques  (red).  This 
system  is  for  the  graph(s)  in  Figure  7.  A  universal  system  is  described 
in  Figure  10. 


Universal  DNA  Computer  for  any 
Graph  on  n  Vertices 


Every  Graph  G  on  n  vertices  has 
G  union  G'=  all  possible  pairs  on 
n  vertices.  This  enable  the  construction 
of  a  universal  device. 


Each  possible  edge  is  an  filter.  Then  depending 
on  the  problem,  the  flow  is  directed  by  the  edges 
present  (or  absent)  in  the  given  graph 

■  .•  {n-2,n} 


{n-1 ,n} 


Edges  in  G  ON,  Edges  in  G’  OFF  ^Independent  Sets  in  G 
Edges  in  G  OFF,  Edges  in  G’  ON  =Cliques  in  G 


Figure  10.  Universal  DNA  Independent  Set  Computer 
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5.3  t-Stem  DNA  Code  Generation  Software 


We  describe  a  program  which  we  make  freely  available.  The  program(s) 

generates  DNA  codes.  Some  of  the  inputs  are: 

1.  Length  of  DNA  codewords:  n  ;  2.  Stem  sizes  checked:  ;  3.  Corresponding 

thresholds  for  each  stem  size:  NA2>---  ;  4.  Maximum  CH  free  energy  parameter: 
AG  or  ;  5.  Nearest  neighbor  WC  free  energy  lower  bound  parameter:  6.  Nearest 
neighbor  WC  free  energy  upper  bound  parameter:  A  Gwc. 

What  is  generated  is  a  DNA  code  C  such  that: 

1 .  x  e  C  =>  |x  |  =  n  and  Y  e  C.  Thus  the  WC  complement  of  each  strand  in  the 
code  is  also  in  the  code. 

2.  x  ^  y  e  C  =>  y/‘‘(x,y)  <  Si.  Thus  the  maximum  number  of  t,  -  stems  in  each  CH 
duplex  from  C  is  at  most  .S'/. 

3.  x  ±  y  e  C  =>  \y}{x,y)  <  A G  ch-  Thus  each  CH  duplex  in  C  has  a  free  energy  of 
formation  above  -A  G  ch 

4 .  x  e  C  ^  4 Gwc  <  llx<'  ^  II  F  -  A Gwc.  xhus  each  WC  duplex  in  C  has  a  free  energy 
of  formation  between  -A Gwc  and  ~A Gwc. 

Example  7.  Below  is  a  DNA  code  generated  by  one  of  our  programs  with  the  inputs: 

n=16;  t  i  ,t  2  ,t  3  =1,2,3;  s  i  ,s  2  ,s  3  =10,6,2;  ^Gqh  =8  ;  AGWC  =  18;  A Gwc  =  22. 

No  codeword  contains  GGG  or  CCC  as  a  substring.  The  complement  of  any  strand  is 

either  to  the  immediate  right  or  left  of  the  given  strand.  There  are  30  codewords  in  the 

code  below. 
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GGCCAAAAAAAAAAAA,  TTTTTTTTTTTT GGCC,  GGCAAAGGTTTT CCAA, 

TT  GG  A  A  A  ACCTTT  GCC,  C  ATTTT  AAGG  A  ACCGG,  CCGGTT  CCTT  A  A  A  AT  G, 
TCCTCTTTCTTTACCA,  T  GGT  A  A  AGA  A  AGAGGA,  TAGAATCCGTCAATTT, 

A  A  ATT  GACGGATT  CT  A,  GGTT  ACGGT  GGT  GTTT,  AA  ACACC  ACCGT  A  ACC, 

TTT GT CACTT GT GGAG,  CTCCACAAGTGACAAA,  AGT ATTT CGAT CTTCC, 
GGAAGATCGAAATACT,  C  AGGCGTT  GAT  G  A  ACA,  T  GTT  CATC  A  ACGCCT  G, 

T AACT AT GTAGCAT GG,  CCATGCTACATAGTTA,  CAACAATAGGAGGCTT, 
AAGCCTCCTATTGTTG,  GGACTT AGGC AGACGT,  ACGTCTGCCTAAGTCC, 
GAGCGAGGTAGATTAG,  CTAAT CTACCTCGCTC,  GAT ACAC ACGGCAT AT, 
ATATGCCGTGTGTATC,  CGAGTGGCTCTCTCAT,  ATGAGAGAGCCACTCG, 

To  further  minimize  errors  in  the  applications,  further  constrains  on  the  code  were 
considered.  Below  is  a  DNA  code  generated  by  one  of  our  programs  with  the  inputs: 

n=12;  t  i  =2;  s  1  =6;  A G ch  =7  ;  A Gwc  =  14;  A Gwc  =  17.  No  codeword  contains 
GGG  or  CCC  as  a  substring.  In  addition,  in  any  given  WC  pairs  of  DNA  codewords,  only 
one  strand  contains  a  G.  This  is  achieved  by  selecting  “ACT- AGT  only”  This  strand  will 
be  used  as  the  probe  strand.  Thus  the  WC  complement  of  strand  X  is  listed  as  Probe(X) 
below.  Moreover,  with  the  addition  of  the  coupling  constraint  we  also  ensure  that 

sequences  that  are  fonned  in  the  middle  of  the  junctions  of  any  library  strand  T  ,  T  ,+  i ,  T 

i  F  i+ i  ,  F  i  T  i+i ,  F  i  F  i+  i  all  obey  the  code  constraints.  This  is  to  ensure  that  probes 
do  not  hybridize  at  locations  where  code  strands  are  ligated  into  library  strands.  See 
Figure  10. 
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Cooperative  DNA  Codes  with  Coupling 

CCAACCAAAAAA  =  T1  Prevents  probes  from  bonding  across  ligated  junctions 

TTTTTTGGTTG  G=Probe(T  1 ) 

AAAAAAACCAC  C=F1 


GGTGGTTTTTT  T=Probe(Fl) 


TTTTTCCTTCCA  =T2 

TGGAAGGAAAA  A=Probe(T2) 
TTACCTCAAACC  =F2 

GGTTTGAGGTAA  =Probe(F2) 
TTT  CACAACTC  C=T3 

GGAGTTGTGAA  A=Probe(T3) 
TTCAATCCACAA  =F3 

TTGTGGATTGA  A=Probe(F3) 
TCACTCTCTCAA  =T4 

TTGAGAGAGTG  A=Probe(T4) 
TCTTTCTCCTC  T=F4 

AGAGGAGAAAG  A=Probe(F4) 
CATCTCACCATC  =T5 

GATGGTGAGAT  G=Probe(T5) 
AACACTACACAC  =F5 

GTGTGTAGTGT  T=Probe(F5) 


No  cross  CH  — ► 


VV199 V911199 


AAAAAA-TTTTT  C=T1T2 
A  C  C  A  C  C-  T  T  T  T  T  C=  F1T2 
AAAAAA-TT  ACCT  =T1F2 
ACCACC-TTACC  T=F1F2 
CTTCCA-TTTCA  C=T2T3 
CAAACC-TTT  CAC  =  F2T3 
CTTCCA-TTCAAT  =T2F3 
CAAACC-TTCAA  T,=F2F3 
AACTCC-TCACT  C=T3T4 
CCACAA-TCACTC  =F3T4 
AACTCC-TCTTTC  =T3F4 
CCACAA-T  CTTT  C=F3F4 
TCTCAA-CATCT  C=T4T5 
TCCTCT-CATCTC  =F4T5 
TCTCAA-AACACT  =T4F5 
TCCTCT-AACAC  T=F4F5 


Auti 


CAACCA  (AAAAA-TTACCT)(CAA  ACC-TTCA  AT)(CCACAA-TCACT  6^  (TCTCAA-CATCT  C)  ACCATC 

T1-(T1  F2)(F2F3„F3T4,(T4T5,T5  Nq  bondjng 

No  misreads 


VV199 V911199 


CAACCAAAAAA-TTACCTCAAACC-TTCAATCCACAA-TCACTCTCTCAA-CATCTCACCATC 


Yes  bonding 
Good  read 


T1-F2-F3-T4-T5 


Figure  11.  Cooperative  DNA  Code 
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